-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Enhancement] Refactor Cleanup Task Handler #516
base: main
Are you sure you want to change the base?
Conversation
… refactor-cleanup-task
… refactor-cleanup-task
.atWarn() | ||
.addKeyValue("batchFiles", batchFiles.toString()) | ||
.addKeyValue("tableId", tableId) | ||
.log("File batch cleanup task scheduled, but the none of the file in batch exists"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.log("File batch cleanup task scheduled, but the none of the file in batch exists"); | |
.log("File batch cleanup task scheduled, but none of the files in batch exists"); |
|
||
@Override | ||
public boolean canHandleTask(TaskEntity task) { | ||
throw new UnsupportedOperationException("This method must be implemented by subclasses."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use an abstract class then?
return LOGGER; | ||
} | ||
|
||
public static final class BatchFileCleanupTask { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be a record
?
allDeletes.join(); | ||
} catch (Exception e) { | ||
LOGGER.error("Exception detected during batch files deletion", e); | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imho this a bit too coarse-grained. The likelihood of having one failed delete will grow bigger with the number of files to delete. Shouldn't we return a more meaningful result, e.g. a statistics object reporting how many files were deleted, and how many failed? Not saying you should change this now though, because returning a boolean is imposed by TaskHandler
API, so we need to look into improving the API itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That’s a great point! Since this is the first time we’re introducing a batch files handler, it would be helpful to discuss how to refine the TaskHandler API for future scenarios. @collado-mike, do you have any suggestions or thoughts on this?
Thank you for the valuable feedback, @adutra ! Your comments were super helpful in improving the code quality. I’ve incorporated your suggestions and updated the code in the latest commit. |
@Override | ||
public abstract boolean handleTask(TaskEntity task); | ||
|
||
public abstract Logger getLogger(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is imho not necessary, using this class' LOGGER
is enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is imho not necessary, using this class'
LOGGER
is enough.
Thanks for reminding, already fixed it!
// Schedule the deletion for each file asynchronously | ||
List<CompletableFuture<Void>> deleteFutures = | ||
validFiles.stream() | ||
.map(file -> super.tryDelete(tableId, authorizedFileIO, null, file, null, 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I'd have expected that this class actually leverages batched deletes of the underlying object store, but seems it does not.
List<String> batchFiles = cleanupTask.batchFiles(); | ||
try (FileIO authorizedFileIO = fileIOSupplier.apply(task)) { | ||
List<String> validFiles = | ||
batchFiles.stream().filter(file -> TaskUtils.exists(file, authorizedFileIO)).toList(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here and in line 69 are exists-checks performed for each individual file, which is inefficient (slow) and also incurs unnecessary cost for cloud object storages.
I'd prefer an approach that
a) leverages object-storage batch requests
b) does not perform the same operation more than once
Description
This PR is created to refactor
ManifestFileCleanupTaskHandler.java
. The key changes include:FileCleanupTaskHandler.java
BatchFileCleanupTaskHandler.java
ManifestFileCleanupTaskHandler.java
Fixes #515
Type of change
Please delete options that are not relevant.
How Has This Been Tested?
Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration
Test Configuration:
Checklist:
Please delete options that are not relevant.